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Abstract 

We propose the use of l\ regularization in a wavelet basis for the solution of linearized seismic 
tomography problems Am = d, allowing for the possibility of sharp discontinuities superimposed 
on a smoothly varying background. An iterative method is used to find a sparse solution m that 
contains no more fine-scale structure than is necessary to fit the data d to within its assigned errors. 
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1 Introduction 

Like most geophysical inverse problems, the linearized problem Am = d in seismic tomography is un- 
derdetermined, or at best offers a mix of overdetermined and underdetermined parameters. It has there- 
fore long been recognized that it is important to suppress artifacts that could be falsely interpreted 
as 'structure' in the earth's interior. Not surprisingly, strategies that yield the smoothest solution m 
have been dominant in most global or regional tomographic applications; these strategies include seek- 
ing global models represented as a low-degree spherical harmonic expansion Dziewonski et al.(1975)| 



|Dziewonski fc Woodhouse(1987) Masters et al.(1996)| as well as regularization via minimization of the 



gradient (Vm) or second derivative (V^m) norm of a d ense local parametrization [Nolet(1987)[ Constable et al.(1987)| 



Spakman fc Nolet(1988)"| |VanDecar fc Snieder(1994)HTrampert fc Snieder(1996) 



Smooth solutions, however, while not introducing small-scale artifacts, produce a distorted image 
of the earth through the strong averaging over large areas, thereby making small-scale detail diffi- 
cult to see, or even hiding it. Sharp discontinuities are blurred into gradual transitions. For ex- 
ample, the inability of global, spherical-harmonic, tomographic models to yield as clear an image of 
upper-mantle subduction zones as produced by more localized studies has long been held against them. 
[Deal et al.(1999)| and [Deal fc Nolet(1999)| optimize images of upper-mantle slabs to fit physical models 
of heat diffusion, in an effort to suppress small-scale imaging artifacts while retaining sharp boundaries. 
[Portniaguine fc Zhdanov(1999) use a conjugate-gradient method to seek the smallest possible anoma- 
lous domain by minimizing a norm based on a renormalized gradient Vm/(Vm ■ Vm + j 2 ) 2 , where 7 is 
a small constant. Like all methods that deviate from a least-squares type of solution, both these methods 
are nonlinear and pose their own problems of practical implementation. 

The notion that we should seek the 'simplest' model m that fits a measured set of data d to within 
the assigned errors is intuitively equivalent to the notion that the model should be describable with a 
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small number of parameters. But, clearly, restricting the model to a few low-degree spherical-harmonic or 
Fourier coefficients, or a few large-scale blocks or tetrahedra, does not necessarily lead to a geophysically 
plausible solution. In this paper we investigate whether a multiscale representation based upon wavelets 
|Daubechies (1992]] has enough flexibility to represent the class of models we seek. We propose an l\- 
norm regularization method which yields a model m that has a strong tendency to be sparse in a wavelet 
basis, meaning that it can be faithfully represented by a relatively small number of nonzero wavelet 
coefficients. This allows for models that vary smoothly in regions of limited coverage without sacrificing 
any sharp or small-scale features in well-covered regions that are required to fit the data. Our approach is 
different from an approach briefly suggested by |de Hoop fc van der Hilst(2005) , in which the mapping 
between data and model is decomposed in curvelets: here we are concerned with applying the principle 
of parsimony to the solution of the inverse problem, without any special preference for singling out linear 
features, for which curvelets are probably better adapted than wavelets. 

In Section 2 we give a short description of the mathematical method, and in Section 3 we con- 
sider a geophysically motivated, toy 2D application, in which the synthetic data are a small set of 
regional, fundamental-mode, Rayleigh-wave dispersion measurements expressed as wavenumber pertur- 
bations Sk(v) at various frequencies v. To enable us to concentrate on the mathematical rather than the 
geophysical aspects of the inverse problem, we assume that the fractional shear-velocity perturbations 
(51n/3 = 5(3/(3 within the region are depth- independent. Finite-frequency interpretation of the surface- 
wave dispersion data Zhou et al.(2004) then yields a 2D linearized inverse problem of the form Am = d. 
We compare wavelet-basis models m obtained using our proposed i^-norm regularization with models 
obtained using more conventional £ 2 regularization, both with and without wavelets, and show that the 
former are sparser and have fewer small-scale artifacts. 



2 Mathematical principles 

In any realistic tomographic problem, the linear system Am = d is not invertible: even when the 
number of data exceeds the number of unknowns, the least-squares matrix A T A is (numerically) singular. 
Additional conditions always have to be imposed. The proposed regularization method is based on the 
fundamental assumption that the model m is sparse in a wavelet basis Daubechies ( 1992)] . We believe 



that this is an appropriate inversion philosophy for finding a smoothly varying model while still allowing 
for whatever sharp or small-scale features are required to fit the data d. An important feature of the 
method is that the location of the small-scale features does not have to be specified beforehand. 

A wavelet decomposition is a special kind of basis transformation that can be computed efficiently 
(the number of operations is proportional to the number of components in the input). At each step 
the algorithm strips off detail belonging to the finest scale present — this detail is encoded in wavelet 
coefficients, broadly corresponding to local differences — and calculates a coarse version — encoded in 
scaling coefficients, broadly corresponding to local averages — that is only half the size of the original in 
ID and only one quarter the size in 2D. This procedure is repeated on the successive coarse versions. The 
resulting wavelet coefficients (at the different scales) and scaling coefficients (at the final coarsest scale 
only) are called the wavelet decomposition of the input. By this construction each wavelet coefficient 
carries information belonging to a certain scale (by virtue of the decimation) and a certain position (use 
of local differences). The final few scaling coefficients represent a (very) coarse average. 

The mathematical relation between the wavelet-basis expansion coefficients w and the model m 
is the wavelet transform W (a linear operator): m = W T w. By choosing the local differences and 
averages carefully (corresponding to a choice among many different so-called wavelet families), the inverse 
transformation from w back to m can be made equally efficient. In our application we will use a special 
kind of 2D wavelet basis that is overcomplete: it contains six different wavelets corresponding to different 
directions. Because of this overcompleteness, the wavelet transform W has a left inverse (namely W T ): 
W T W = I, but no right inverse, WW T ^ I. Appendix iBl contains a short overview of this particular 
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Figure 1: Sparsity, l\ minimization and £2 minimization: Left: Because the ^i-ball has no bulge, the solu- 
tion with smallest ^i-norm is sparser than the solution with smallest ^2-norm. Right: A \w\ penalization 
effects small coefficients more and large coefficients less than the (traditional) w 2 penalization. 



construction. In short our wavelet and scaling coefficients w contain information on scale, position and 
direction. 

For the tomographic reconstruction, we will require a sparse set of wavelet-basis coefficients: the vast 
majority of these represent differences and will only be present around non-smooth features. In this way 
we regularize the inversion by adapting ourselves to the model rather than to the operator. As a measure 
of sparsity we will use the ^-norm of the wavelet representation w of the model m, i.e. we will look for a 
solution of the linear equations Am = d that has a small ||w||i = J2i \ w i\- Since \wi\ > \wi\ 2 for small Wi 
and \wi\ < \wi\ 2 for large Wi, this type of penalization will favor a small number of large coefficients over 
a large number of small coefficients in the reconstruction (whereas a traditional £2 penalization might 
do the opposite). We are not claiming that the sparsest solution always coincides with the minimum 
<?i-norm solution, but one can show that it often does Donoho(2004) Candes et al.(2006)| . A schematic 
justification for this is given in Fig.^ 

In particular, our strategy will consist of searching for the minimizer of the functional 

ii(w) = ||d- Am||^ + 2r||w||i = ||d - AW T w||^ + 2r||w|| 1; (I) 

where r is an adjustable parameter at our disposal. Here, the first (quadratic) term corresponds to the 
conventional statistical measure of misfit to the data, x 2 = Y^A^i ~ (Am)i] 2 , and the second (£i-norm) 
term 2r||w||i is introduced to regularize the inversion. In writing \ 2 m this form, we have made the 
simplifying assumption that the noisy data d are uncorrelated with unit variance. More generally, the 
misfit portion of the functional Q is x 2 = (d — Am) T E _1 (d — Am), where £ is the data covariance 
matrix. In the 2D toy problem considered in Section 3, we invert synthetic data d having a constant (but 
non-unit) variance, S = a 2 I. 

The minimizer of the functional JIJ can be found by iteration |Daubechies et al.(2004)| : starting with 
the present approximation one constructs an nth-iterate surrogate functional 

l[ n) (w) =/i(w)- ||AW T (w- w^)||2 + ||w- w^H?, (2) 
that has the same value and the same derivative at the point w = w'™- 1 as the original functional (see 
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Fig. [21) - This surrogate functional can be rewritten as 



w - WA' d + (I WA' AW' )w 



(n) 



■2t||w||i 



(3) 



where is independent of w. This functional has a much simpler form than the original -Zi(w) because 
there is no operator AW mixing different components of w. The next approximation 

w («+i) i s 

defined 

by the minimizer of this new functional. By calculating the derivative of expression © with respect 
to a specific wavelet or scaling coefficient «;,, one finds the following set of component-by-component 
equations: 

Wi - (\VA T d + (I - WA T AW T )w (n) ) + t sign(ioi) = 0, (4) 

valid whenever Wi ^ 0. These equations are solved by distinguishing the two cases Wi > and Wi < 0; the 
solution — corresponding to the minimizer of the surrogate functional l[ n \w), and denoted by w(™ +1 ) — 
is then found to equal 



WA 1 d + (I - WA 1 AW 1 )w 



{n) 



(5) 



where S T is the so-called soft-thresholding operation, i.e. 



S T (w) 




W > T 
\w\ < T 
W ■' — T, 



(6) 



performed on each wavelet or scaling coefficient Wi individually. The starting point of the iteration 
procedure is arbitrary, e.g. w( ) = 0. Because of the component-wise character of the tresholding, it 
is straightforward to use different thresholds tj for different components Wj if desired, and in fact we 
shall use different thresholds r w and r s for the wavelet and scaling coefficients in our application. A 
schematic representation of the idea behind the iteration is given in Fig. [21 We realize that this 
iteration converges slowly for ill-conditioned matrices, but we use it here because it is proven to converge 



to the solution Daubechies et al.(2004) 



An improvement in convergence can be gained by rescaling the operator A (and rescaling the data 
d at the same time) in such a way that the largest eigenvalue of a 2 A T A is close to (but smaller than) 
unity. The iteration corresponding to the minimization of this new, rescaled functional is 



>+i) 



S Ta 2 a 2 WA T d + (I - a 2 WA T AW>(»> 



(7) 



We will also make use of the following two-step procedure: from the outcome m = W T w of the iteration 
Q , we define new, linearly shifted data d' = 2d — Am and restart the same iteration with this new data: 



a 2 WA T d' + (I - a 2 WA T AW T )w 



(„.) 



,(0) 



(8) 



The outcome m = W T w of this second iteration is then the final, regularized reconstruction of the 
model. For the same value of the regularization parameter r, the second step improves the data fit 
considerably, |jd — Am|| 2 < ||d — Amj| 2 ; hence a given level of final data fit \ 2 will, in the two-step 
procedure, correspond to a higher value of r. Because r specifies the threshold level, a higher value will 
lead to more aggressive thresholding and thus faster convergence to a sparse solution. 

The above method will be demonstrated in the next section and compared to a conventional £2- 
regularization method, in which the functional 



/ 2 (m) 



||d-Am|| 2 + r||m|| 2 



(9) 
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Figure 2: Loft: The functional ii(w) is approximated in the vicinity of w^™) by a surrogate functional 
l[ n \w), constructed in such a way that its minimum is easy to find (eq. (J5J). This defines the next step 
in the iteration. Right: Soft thresholding function S T (w). 



is minimized (the crucial difference with ii(w) being the second term). This gives rise to the familiar 
system of damped normal equations 

(A T A + rl)m = A T d, (10) 

whose solution m = (A T A + rI) _1 A T d can be found using a linear solver of choice, since A T A + rl 
is a regular matrix. To emphasize the similarities and differences with the £\ method, we adopt the 
classical Landweber iteration [Landweber(1951) that can be (but in modern applications seldom is) used 
for solving the linear equations 



m 



= A T d+ [I- (A T A + rI)]mW, m( 0) =0. (11) 



No thresholding is employed here. Rescaling of the operator and the data again improves the rate of 
convergence: 

m(" +1 ) = a 2 A T d + [l-(a 2 A T A + ™ 2 I)]m(™\ = 0. (12) 

Of course it is also possible to solve the linear system (|10H using a conjugate-gradient or similar algorithm 
in much less time. 

A third option is to use an £2 penalization on the wavelet coefficients. This allows us to penalize 
the scaling coefficients differently than the wavelet coefficients (with the help of different penalization 
parameters r s and r w ). We can use the following iteration, similar to formula (|12|l . but now in the wavelet 
domain: 



w (n+l) = a 2 WA T d . 



I - (a 2 WA T AW T + a 2 I) w< n >, w<°> = 0, (13) 



where I acts as the r w x identity on wavelet coefficients and as the t s x identity on scaling coefficients. 
If we were to use an orthonormal wavelet basis (W T W = WW T = I) for our expansions, and if we 
penalized every coefficient the same, r w = t s = r, then this method would be identical to the previous £2 
method. 

In the following we consider both one-step and two-step £\ wavelet penalization as well as conventional 
£2 penalization, both without and with wavelets, using the rescaled iterative schemes 0, ||SJ, <|12[) and (|13fl 
for the purposes of comparison. 
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3 Implementation 



To test the above ideas, we devised a dramatically simplified, two-dimensional, synthetic surface- wave in- 
version problem very loosely modeled after an actual Passcal deployment in Tanzania Owens et al.(1995)| . 



Fig- El (left) shows the hypothetical experimental setup: the highly schematized input model consists of a 
sharp, bent, East African rift structure with low shear-wave velocity, <51n/3(x, y) < 0, superimposed upon 
a smooth, circular cratonic positive anomaly, 5hi(](x,y) > 0. Eleven earthquake events (circles) were 
taken from the NEIC catalogue to mimic realistic regional seismicity for the duration of a typical tempo- 
rary deployment of the twenty-one stations (triangles). The locations of the seismic stations and events 
are listed in Tabled For each of the 11 x 21 source-receiver paths, we assume that fundamental-mode 
Rayleigh-wave perturbations Sk(y) have been measured at eight selected frequencies between v « 0.01 Hz 
and v w 0.1 Hz. These wavenumber perturbations are related to the 2D, depth-independent velocity per- 
turbations <51n/3(x, y) via a 2D, frequency-dependent sensitivity kernel (see Appendix A for more details): 

Sk(v) = jj K 2D (x,y, u)8\nfi{x,y) dx dy. (14) 

Plots of the lowest-frequency (y ss 0.01 Hz) and highest-frequency (y ~ 0.1 Hz) kernel Kzd(x, y, v) for a 
typical source-receiver pair are shown in the left two panels of Fig- EJ Because finite-frequency scattering 
and diffraction effects are accounted for in the kernels Kzd(x, y,v), there is significant off-path sensitivity 
of the measurements Sk(v) within the first one or two Fresnel zones |Zhou et al. (2004)|. All kernels 



K2u{x,y, v) and distances are computed in the flat-earth earth approximation. 

The study region, which is 35° (north-south) by 25° (east- west), is subdivided into N x xN y = 64x64 = 
4096 equal-sized rectangles, and the discretized model vector m consists of the unknown constant values 
of 51n/3(x, y) within each rectangle. To compute the matrix A, which maps the discretized model m onto 
the data d (consisting of multiple 8k(y)), each kernel K2T)(x,y, v) is sampled n x x n v times on each of 
the N x x N y model- vector rectangles and a Riemann sum is used to compute the quantity 



Ax Ay x ~ 

rectangle(fe,;) Tl x n y ^ ^ 



if 2D (x, y, v) dx dy « — — i ^ AT 2D (xi - Ax/2 + (m - l/2)8x, y k - Ay/2 + (n- l/2)Sy, v) , 



(15) 

where Sx = Ax/n x and Sy — Ay/n y . A schematic representation of the N x x N y grid on which the 
spatial-domain model m is specified and the n x x n y integration subgrid is shown in Fig. [3] We choose 
n x = n y — 32 since we have found that doubling this to n x = n y = 64 yields a change of less than one 
percent in the integrated value of A. The dimensions of the resulting matrix A are 1848 (number of 
stations x number of events x number of wavenumbers) by 4096 (number of model- vector pixels) . To 
give an idea of the overall degree of coverage, we have plotted the sum (over all station event pairs) of the 
absolute value of all of the lowest-frequency and all the highest-frequency discretized kernels in the right 
two panels of Fig.|5| It is clear that much of the study area, particularly in the northwest and southeast, 
is completely uncovered (as is typical of real- world, regional seismic experiments). 

Using the matrix A and the input model m mput with a sharp, low-velocity East African rift superim- 
posed on a broad, high- velocity cratonic structure, we compute synthetic data d = Am mput + e, where we 
have added Gaussian noise e with zero mean and a standard deviation equal to two percent of the largest 
synthetic wavenumber perturbation, i.e. a — 0.02 max(|Am lnput |). By adopting a constant standard 
deviation a, errors at the highest frequency v and unperturbed wavenumber k(y) are more than an order 
of magnitude smaller than those for the lowest frequency and wavenumber data, where the signal-to-noise 
ratio may be close to unity. Since finite-frequency inversions include the effect of scattered wave energy, 
a high precision of the measurement Sk(v) at high frequency v is realistic. The purpose of the proposed 
algorithm is now to reconstruct m from the knowledge of the noisy data d, the matrix A and the linear 
equations Am = d. 
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Figure 3: Schematic representation of the 2D Cartesian grids used. Left: N x x N y grid used to specify 
the model m. Right: Blowup of the finer-scale n x x n y grid used to compute the kernel matrix A via 
the approximate integration l|15fl . Since the study region is rectangular in shape (see Fig. 0J and since 
N x = N y and n x = n yi the actual Ax x Ay model pixels and Sx x Sy integration subpixels are also 
rectangular, rather than square as shown. 



Table 1: List of positions of seismic stations and earthquake events used in the synthetic inversion. 



Stations 



Events 



longitude 


latitude 


longitude 


latitude 


33.3203° 


-7.9073° 


29.02° 


-1.86° 


35.1382° 


-4.3238° 


49.10° 


12.85° 


32.7712° 


-9.2958° 


44.15° 


11.80° 


33.2588° 


-8.1060° 


30.82° 


-7.84° 


29.6927° 


-4.8392° 


46.34° 


12.33° 


38.6170° 


-5.3018° 


39.17° 


19.02° 


30.3988° 


-5.1168° 


32.78° 


5.06° 


36.5695° 


-5.3223° 


44.15° 


14.57° 


37.4763° 


-5.3775° 


28.84° 


1.16° 


36.7192° 


-3.8422° 


40.33° 


14.20° 


35.7965° 


-4.9040° 


33.67° 


-3.05° 


36.6983° 


-2.7252° 






34.3462° 


-4.9610° 






34.0560° 


-6.0192° 






35.4007° 


-5.2508° 






33.2415° 


-8.9835° 






33.1842° 


-4.7145° 






33.5180° 


-6.9372° 






34.7315° 


-4.6403° 






36.0163° 


-3.8892° 






32.0832° 


-5.0878° 
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Figure 4: From left to right: Toy 2D velocity model for the East African rift and adjacent continental 
craton, showing the seismic stations (triangles) and earthquake events (circles); reconstructed model using 
the two-step ^-penalization method; reconstruction using the spatial-domain 1% method; reconstruction 
using the wavelet-domain £ 2 method. The two-step l\ model is that obtained after 1000+1000 iterations, 
whereas both £ 2 models are after 2000 Landweber iterations. Red denotes low anomalous velocity, 
8\n(3(x,y) < 0, and blue denotes high velocity, S\n(3(x,y) > 0. The absolute magnitude \5\n(3(x,y)\ is 
irrelevant, since the inverse problem Am — d is linear and the synthetic data are constructed from the 
input model m mput (leftmost map) via d = Am mput + e. 



0.01 Hz 0.1 Hz 0.01 Hz 0.1 Hz 




30° 40° 50° ~ 30° 40° 50° 30° 40° 50° ~ 30° 40° 50° 



Figure 5: Left: Map view of a typical two-dimensional sensitivity kernel K2u{x,y, v) at the lowest 
frequency considered, v k, 0.01 Hz. Second from left: Highest-frequency (y w 0.1 Hz) kernel for the 
same source-receiver path. Both kernels exhibit structure on a much finer scale than the resolution of 
the model, necessitating the n x x n y numerical integration to compute the matrix A in eq. H15JI. Red 
denotes negative values, K2b(x, y, v) < 0, and blue denotes positive values, K2n(x, y, v) > 0. The cross- 
path tapering of the kernels as a result of the finite time-domain taper, eqs ljA6(l - l|A7(l . is clearly visible. 
Second from right: The sum (over all source-receiver pairs) of the absolute value of the lowest frequency 
[y pa 0.01 Hz) integrated kernels (as computed in ea. I15f) . Far right: The sum (over all source-receiver 
pairs) of the absolute value of the highest frequency (y w 0.1 Hz) integrated kernels (as computed in 
eq. 1151) . The coverage is adequate in the vicinity of the East African rift (by design of the original seismic 
deployment) but poor elsewhere. 
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Figure 6: Spatial-domain structure of the 2D dual-tree complex wavelets used in the reconstruction (figure 
taken/adapted from [Selesnick et al.(2006)| ). First row: real part, second row: imaginary part, third row: 
norm squared (i.e. sum of the squares of the top two plots). The directional character of each of the six 
wavelet functions is clear. Four different wavelet scales, i.e. four different replicas of this picture, each a 
factor of two smaller than the one above it, are used in both the l\ and 1% wavelet-basis inversions. 



For our purposes we will make use of the overcomplete 2D wavelet basis described by IKingsbury(2002) 
and Selesnick et al.(2006)| because of its ability to distinguish different directions (sec Fig. |SJ). We use 



four wavelet scales, for a total of 4 x 64 2 = 16 384 wavelet and scaling coefficients w (four times the 
number of model coefficients m). The starting point for the iterations in both the £\ and £2 inversions 
is w = and m = 0. As explained in the previous section we renormalize the £\ thresholded iteration 
by choosing a — Ama( 2 (which in our case equals 4884.5) where A max is the largest eigenvalue of A T A. 
We let the iteration run for 1000 steps, adjust the data (two-step procedure) and let the second-step 
algorithm run for another 1000 steps. The threshold r is chosen by hand in such a way as to arrive at 
a final value for the variance-adjusted misfit, \ 2 — l|d — Aml^/cr 2 , that is approximately equal to 1848 
(the number of data). The noisy data d are thus fit to within their standard errors a and no better; 
pushing the fit beyond this would amount to fitting the noise e, which would lead to undesirable artifacts 
in the resulting model m. 

It should also be noted that the thresholding is done on pairs of wavelet coefficients: The wavelets 
come in pairs (at the same scale, position and orientation) that we interpret as real and imaginary part 
of a complex wavelet, i.e. thresholding corresponds to (w™, w™) — > z = w^ + iw 1 ™ — ► z = zS T (|z|)/|2:| — > 
(Re(f), Im(z)). This particular method of thresholding is borrowed from image denoising where it 
is found to make a big difference in avoiding artifacts |Guleryuz(2006)| van Spaendonck et al.(2003) 



Selesnick et al.(2005) . Furthermore, the threshold for the diagonally oriented wavelets is multiplied by 
1.2395 because ||V'i/'±45°||i = 1.2395||VV>other||i- We choose the threshold r s for the scaling coefficients to 
be l/10th of the threshold r w for the wavelet coefficients; since the scaling coefficients correspond to a 
few large-scale averages (64 in our case versus more than 16 000 finer-scale wavelet coefficients) it is not 
so important that these be sparse. Likewise, in the wavelet-basis £2 inversions, we set the penalization 
parameter for the scaling coefficients to l/10th the value of the penalization parameter for the wavelet 
coefficients. 

The two-step £\ algorithm takes about ten minutes for 1000 + 1000 iterations on a 1.5GHz PC. The 
result of the £\ inversion is compared with the outcome of both of the £2 methods, with and without 
using wavelets, with the thresholding or penalization parameter r chosen in every case to achieve the 
same data fit: \ 2 ~ 1848 (see Fig. [7J). The number of Landweber iterations is 2000, so that the total 
number of two-step l\ and single-step £2 iterations is the same. The spatial-domain ^-regularization 
method yields a relative modeling error ||m— m mput ||2/||m mput ||2 of about 74%, whereas the two-step £\ 
method yields a relative modeling error of only 47% (see Fig. |HJ), and is clearly less noisy (compare the 
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Figure 7: Graph of the variance-adjusted data fit % 2 = ||d — Am||| /a 2 versus the number of iterations: 
two-step £i-regularization method (solid line), spatial-domain £ 2 method (dashed line) and wavelet-basis 
£2 method (dotted line ). The thresholding and penalization parameter r has in each case been tailored 
so that the final value of \ 2 ^ after 1000 + 1000 or 2000 iterations, is equal to the number of data, namely 
1848. Note the improvement in the rate of convergence toward the model with \ 2 — 1848 after the 
implementation of the second step in the two-step £\ iteration. 

middle two maps in Fig.^l. The wavelet-basis ^-regularized inversion (rightmost map in Fig. [3} is only 
slightly less noisy, with a relative modelling error of about 55% (Fig. 0. One feature that can never be 
recovered in any of the reconstructions is the southern part of the rift, which does not lie between any 
station-event pair. 

In Fig. [5] we compare the wavelet coefficients w of the input model, the two-step £\ reconstruction 
and the wavelet-basis £2 reconstruction. In accordance with our basic assumption, the ^i-regularized 
model is sparse in the wavelet basis. Most of the small-scale coefficients w are zero — in agreement with 
the original model on the left — indicating the effectiveness of the iterative thresholding algorithm. The 
wavelet coefficients of the wavelet £2 reconstruction are clearly not sparse. Also this solution seems to 
suffer from large-scale artifacts (see Fig. rightmost map). 

In the leftmost plot in Fig.^Jwe show \ 2 versus ||w|| 1 tradeoff curves for the l\ reconstruction method, 
both with and without using the two-step procedure. After 1000 + 1000 iterations, the £\ wavelet norm 
||w||i of the two-step reconstructed model is lower — for the same value of x 2 — than the corresponding 
norm of the model produced by 2000 iterations of the first step, with no subsequent redefinition of 
the data d and reiteration. This is an indication that 2000 total iterations is inadequate to achieve full 
convergence, since the fully converged model, which minimizes the functional I\ (w) given in eq. , must 
be the minimum-norm model for a fixed value of \ 2 by definition. A much larger number of iterations 
seems to be required to guarantee convergence. To construct the second set of tradeoff curves in Fig.HHI 
we employed 150 000+ 150 000 iterations in the two-step case and 300 000 in the single-step case; such a 
large number would be prohibitive in any larger-scale, more realistic, 3D application. We have chosen to 
limit the iteration counts to 1000 + 1000 or 2000 in all of our model-space comparisons, since any changes 
in the spatial-domain features of the models m are barely discernible to the eye with further iteration. 
The rightmost plot in Fig. shows the principal advantage of using the two-step iteration procedure: 
for the same total number of iterations, either 1000 + 1000 = 2000 or 150 000 + 150 000 = 300 000, the 
number of nonzero wavelet coefficients of the two-step models is always lower than the corresponding 
number for the single-step models. The two-step £\ procedure therefore leads more quickly to a sparser 
wavelet-basis solution, as expected. 

We also compared the single-step and two-step l\ inversion methods with the corresponding £2 re- 
construction methods, both with and without wavelets, for a number of other input synthetic models. 
These include three checkerboard patterns of decreasing scale and a model similar to the geologically 
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Figure 8: Graph showing the relative modeling error ||m — m mput ||2/||m mput ||2 versus the number of 
iterations: two-step £\ method (solid line), spatial-domain £2 method (dashed line) and wavelet-basis 
£2 method (dotted line). The £i-regularization method clearly yields the most faithful reconstruction of 
the input model m mput . Note the (slight) improvement in the rate of decrease of the modelling error 
following the start of the second step in the two-step £\ iteration. 
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Figure 9: Graphical display of (the modulus of) the wavelet and scaling coefficients. Left: coefficients 
of the synthetic input model m mput . Middle: coefficients of the two-step £\ model after 1000 + 1000 
iterations. Right: coefficients of the wavelet-basis £2 model after 2000 iterations. The four wavelet scales 
are plotted, smallest to largest, top to bottom. Each row shows the six different wavelet directions, plotted 
next to each other in the same left-to-right order as the wavelets plotted in Fig. [U Scaling coefficients 
are plotted on the bottom row. White denotes a zero coefficient, Wi — 0. Each rectangle corresponds to 
the spatial domain 25°E - 50°E by 15°S - 20°N. 



11 



2300 
2200 
2100 
2000 



e — one-step 
i two-step 




300 000 
iterations 



12 

Hli 



e — one-step 
two-step 




50 100 150 200 250 

number of non-zero wavelet coefficients 



Figure 10: Left: Data misfit x 2 versus t\ wavelet norm ||w||i tradeoff curve. Right: Alternative tradeoff 
curve showing \ 2 versus the number of nonzero wavelet coefficients of the model m. Different values 
of the thresholding parameter t were used to determine each point on the various curves. Circles: one- 
step method, after either 2000 or 300 000 iterations; crosses: two-step method after an equivalent number 
(either 1000 + 1000 or 150 000+150 000) of total iterations. The relative positions of the one-step and two- 
step curves suggests that 300 000 total iterations is sufficient to achieve full convergence. The horizontal 
dotted lines show the statistically meaningful value of the noisy data misfit, x 2 = 1848 (the number of 
data). 



inspired one in Fig. 0] but with a more curvaceous low- velocity rift (see Fig. Illfl . Both the single-step l\ 
reconstructions and the £2 reconstructions are computed using 2000 iterations, whereas the two-step l\ 
models are computed using 1000+ 1000 iterations. In all cases, the two-step l\ models are the most par- 
simonious and therefore to most geoscientists the most acceptable. One could consider using smoothness 
damping to improve the quality of the £2 images; however, this would be done at the cost of resolving 
the sharpness of the rift structure. A nitpicker could perhaps also argue that the "rift" structure in the 
model produced by the i\ procedure extends further northwards, albeit diminished in amplitude, whereas 
conventional £2 regularization without wavelets exhibits a sharper cutoff, more like the input model. It 
achieves this sharp cutoff, however, at the expense of many artifacts elsewhere, especially along dominant 
ray directions. Perhaps the most noteworthy feature of the l\ regularization method is its suppression 
of the artifacts resembling high-frequency kernel images that are streaked along surface-wave raypaths in 
all the li models, to the north of the rift and within the craton. This is one of the most serious artifacts 
that plague conventional seismic tomography: £2 regularization frequently if not always seems to enhance 
the well-sampled regions of the model. The i\ wavelet-basis reconstructions show no signs of this familiar 
deficiency. 

The computational bottleneck in the present 2D synthetic study is not the wavelet transform — which 
is fast, certainly on a model m of modest dimension 64 x 64 — or even the number of iterations, but it 
is simply the size of the matrix A. A significant amount of time is needed to accurately pre-compute A, 
and considerable memory is needed to store the computed elements in memory; this is necessary because 
the product A T A is used in every step of the iteration. Doubling of the resolution in every direction 
results in a fourfold increase in size of the model m, and a sixteen-fold increase in the number of elements 
in the square matrix A T A. All calculations were performed using Matlab; software for the 2D dual-tree 
wavelets was downloaded from Sclcsnick ct al.(2006) . 
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Figure 11: Results of applying different reconstruction techniques to a number of different 2D toy models. 
From left to right: Original input model; l\ reconstruction (2000 iterations, single-step procedure); l\ 
reconstruction (1000+ 1000 iterations, two-step procedure); 1% reconstruction (2000 iterations, without 
wavelets); £2 reconstruction (2000 iterations, with wavelets). 



4 Conclusions 



We tested several new methods of regularization through wavelet decomposition of a toy 2D tomographic 
problem characterized by both smooth and sharp velocity anomalies. A variety of synthetic inversion 
experiments show that minimization of the fi-norm of a wavelet decomposition of the model leads to 
tomographic images that are parsimonious in the sense that they use only a few wavelets and still represent 
both smooth and sharp features well without introducing significant blurring or artifacts. The ^i-norm 
performs significantly better than an I2 regularization on cither the model or its wavelet decomposition. 
In particular, raypath-associated artifacts are almost completely suppressed. 

The choice of dual-tree complex wavelets in 2D, representing six space directions, is sufficient to 
avoid directional bias, and efficient in modeling both smooth features such as the cratonic structure as 
well as sharp features such as the rift structure in our simplified synthetic model. Numerical comparisons 
between the inversion results and the input model used to generate the data confirm the superiority of the 
^i-norm regularization. Though in real-world inversions such ground-truth information is not available, 
one can argue that the l\ inversion method serves the principle of parsimony well and is to be preferred 
over more common methods. If the tomographic object (such as the real earth) is too complex to be well 
represented by a parsimonious expansion in wavelets, neither method is able to resolve such complexity 
adequately with a limited data set, as shown in the bottom rows of Fig. ^] where even the £1 inversions 
begin to show the effects of raypath distribution. In this case, we expect that the principle of parsimony 
can be usefully applied once a richer family of building blocks is considered. 

The only drawback of the method, so far, is the slow convergence of the i\ surrogate-functional 
iteration procedure. Our preference for the thrcsholded algorithm used here arises from the fact that 
its convergence is guaranteed even though the t\ problem is nonlinear. We have introduced a two-step 
procedure that leads to a significant speedup; however, Fig.^Jindicates that even 1000+1000 iterations do 
not suffice for complete convergence (it nevertheless produces an excellent approximation) . A potentially 
promising approach towards further convergence improvement is to combine an efficient linear method 
(such as e.g. conjugate-gradient) with an adaptive thresholding scheme. This would then avoid the need 
to precompute the largest eigenvalue of A T A and facilitate the application of the l\ method to a larger, 
3D, study of body- wave tomography. 
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Figure 12: Schematic map view of the single-scattering geometry in our simplified 2D, flat-earth, surface- 
wave inversion problem. The quantity I is the horizontal epicentral distance between the source and 
receiver; I' and I" are the lengths of the first and second legs of the detour path, respectively, and the 
angle rj measures the deflection of the wave at the scatterer. 
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A Two-dimensional sensitivity kernels 

The toy linear inverse problem Am = d used in this paper is designed to incorporate all the impor- 
tant characteristics of a real-world regional tomographic inversion, while at the same time being small 
enough to allow for repeated experimenting with reasonable CPU times on a single workstation. For 
this reason, we limit attention to surface-wave dispersion data, specifically perturbations Sk(v) in the 
wavenumber k{y), presumed to be measured in rad/m, of the fundamental (n — 0) Rayleigh mode at 
temporal frequency v, measured in Hz. Finite-frequency theory based upon the Born approximation 
[Zhou et al.(2004) gives a linear relationship between such wavenumber perturbations and the 3D per- 



turbations in the fractional shear-wave velocity Jln/3(x) within the earth: 

6k(u) =111 K 3D (x,v)5\nl3(x) d 3 x. (Al) 



Making use of a number of flat-earth approximations that do not fundamentally affect the nature of the 
inverse problem, we can write the 3D Frechet sensitivity kernel K^d(x, v), for the simplest case of an 
explosive source with an isotropic radiation pattern and a measurement made on the vertical component 
at the receiver, in the form 

AT 3 d(x, v) = [e (z, v) + ei(z,v)cosr] + e 2 (z, u) cos 2n] \ ^^(~jfj7j7n sm[k(v)(l' + I" - I + tt/4)] , 

(A2) 

where z is the depth, I is the epicentral distance measured in m on the surface of the earth, and I' and I" 
are the horizontal distances of the scatterer x = (x, y, z) from the source and receiver, respectively. The 
quantity 77 is the scattering angle, measured at the surface projection (x,y) of x, as shown in Figure IT21 
Expressions for the depth-dependent functions e (z, v), e\(z, v) and e 2 {z, v) can be found in the appendix 
of |Zhou et al. (20041 . 
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To simplify matters even further, we assume that the velocity perturbation 51n/3(x) is independent 
of depth z and dependent only upon the horizontal Cartesian coordinates x and y. Upon integrating the 
factors ea{z,v), e±(z, v) and e2(z,v) over depth, 

/•CfO poo />oo 

Eq{v) = I e (z,v)dz, Ei(u)= / ei(z,v)dz, E 2 {v) = I e 2 (z,u)dz, (A3) 
Jo Jo Jo 

we may then relate 8k(y) to 5\u(3{x, y) via a 2D sensitivity kernel: 

Sk(u) = JJ K m (x,y,v)5lnf3(x,y)dxdy, (A4) 

where 

K 2D (x,y, v) = [Eo(v) + Exiv) cos v + E 2 {v) cos 2??] ( —l^-^ j ' wa[k(y){l' + l" — 1 + tt/4)] . (A5) 

The rapidly oscillating sinusoidal function sin[fc(f)(Z' + I" — I + tt/4)] in eq. (|A5(I is constant on ellipses, 
I' + I" = constant, having the surface projections of the source and receiver as foci. The cos rj and 
cos 2r] dependence and the term involving the integrable singularity 1 / \Jl'l" act to slowly modulate this 
dominant elliptical dependence. 

Eqs (|A4|) and (| A5|) are valid, subject to the already noted approximations, for a monochromatic 
wavenumber perturbation 5k{y), whereas actual surface- wave dispersion measurements must of necessity 
be made on a portion of a seismogram of finite length, typically multiplied by a time-domain taper h(t). 



Zhou et al.(2004) show that the effect of such a finite-length taper can be accounted for by modifying 
the taper as follows: 

K 2B (x, y, v) -» K 2D (x, y, v) h{(l' + l")/C{v)), (A6) 

where C(y) is the the group velocity at frequency v measured in m/s. This modification has the effect of 
limiting the cross-path width of the Frechet kernel K (x, y, u), since h(t) — for large detour times. We 
assume the data Sk(y) have been measured using a Hann or cosine taper, of duration five wave periods 
centered on the group arrival time: 

!0 for t < tarrival — 2.5/^ 

\ [1 - COS 2liv(t - t arriva l - 2.5/ v)] for ^arrival -2.§/v <t< tarrival + 2.hjv (A7) 

for t > t arr i V al + 2. 5/u 

where t ar rivai = ljC{y). Since V + 1" > I only the t > t arr ivai portion of the taper (|A7(I contributes to the 
finite-record-length sensitivity kernel (|A6|) . 

The group velocity C(v), unperturbed wavenumber k{y) and auxiliary variables Eq(v), E\(i>) and 
E 2 (y) for fundamental- mode Rayleigh waves are listed in Table [21 at the eight selected frequencies v\ 
the corresponding wave periods vary roughly between 100 and 10 s. Since Eq(u), E\(y) and E 2 {y) 
are all negative, a positive velocity perturbation, <51n/3(a;,y) > 0, gives rise to a negative wavenumber 
perturbation, 5k(v) < 0, i.e. an apparently longer wavelength wave, as expected. See Fig. for two 
examples of sensitivity kernels K 2 Y>{x,y,v) computed in this way. It is noteworthy that a 2D surface- 
wave inversion based upon eqs (|A4l) - (|A7p differs from the common approach of inverting for a 2D phase 
velocity map at a single specified frequency v. such maps are strictly incompatible with the notion of 
finite frequency, where no local phase velocity can be defined except when very crude approximations are 



made; for a discussion of this issue see Zhou et al.(2004) . 

B Notes on wavelets 

The basic building block of the ID discrete wavelet transform (DWT) is a filter bank. It consists of a 
high-pass filter g (i.e. a generalized difference) and a low-pass filter h (i.e. a generalized average) that 
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Table 2: Parameters v, C(v), k(v), E${v), E\ (y) and Ei(y) needed to compute the simplified 2D sensitivity 
kernels K2T>(x,y 1 v). Fundamental-mode Rayleigh-wave measurements 5k(v) are presumed to have been 
made at eight frequencies ranging between v w 0.01 Hz (100 s period) and v w 0.1 Hz (10 s period). 
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Figure Bl: Left: Schematic representation of a perfect-reconstruction filter bank that can be used to 
decompose or reconstruct a ID signal x. Right: A standard wavelet tree. 



are applied to a given signal m (i.e. a list of numbers) in the following way: m is convolved with g and 
downsampled, m is convolved with h and downsampled. This results in two signals, each with half the 
length of the original one. The process can be inverted by upsampling (inserting zeroes) the two resulting 
sequences and convolving each with two (carefully matched) filters g and h and then adding the two. A 
traditional way of representing this procedure is shown in the left of Fig. IB II It turns out that there 
exist finite filters that give rise to perfect reconstruction (these use finite convolutions only and lead to 
compactly supported wavelets); moreover in some very special cases, one can have that finite g and h are 
the reverse of g and h (corresponding to compactly supported orthogonal wavelets). The Haar wavelets 
have g = — i) and h = (5, i), but there exist longer (perfect reconstruction) finite filters (which give 
rise to smoother wavelets). The so-called D4 wavelets correspond to h = (1 + V3, 3 + \/3, 3 — y/3, 1 — 
V3) /4V2 and g = (1 - V3, -3 + y/3, 3 + V3, -1 - V3)/4\/2. 

The ID discrete wavelet transform is defined by the iteration of the analysis filter bank on the low- 
pass outcomes (see right side of Fig. IBljl . In this way, successive levels of detail are stripped of the input 
signal m (and stored in wavelet coefficients), leaving a very coarse average (stored in so-called scaling 
coefficients). This construction is called a wavelet tree. It not only defines the DWT but also provides 
its practical implementation. When using finite filters, the construction automatically gives rise to a 
computationally efficient algorithm: as a result of the subsampling each step cost only half as much time 
as the previous one. The total number of operations then is kN + kN/2 + kN /A + kN/8 + . . . = 2kN, 
less than the 0(N 2 ) for a generic linear transformation. 

A standard way of generating wavelets in 2D is to form the direct product of ID wavelets, i.e. the 
filters are applied to rows and columns of an image (lo-lo, lo-hi, hi-lo and hi- hi). This, however, has the 
marked disadvantage of poor directional sensitivity. In this study, to obtain better directional sensitivity, 
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Figure B2: Partitioning of the 2D Fourier domain (k x ,k v ) by the supports of the Fourier transform of 
wavelet functions. Only two wavelet scales are shown with the finest one on the outside. In practice the 
supports have smooth (overlapping) tapers. Left: With the complex 2D wavelets used in this paper, all 
squares come in pairs giving rise to six dominant directions (as indicated by the hatch patterns). Right: 
The standard (direct product) 2D construction only has horizontal and vertical sensitivity; the 'corner' 
(hi-hi) squares encode both 45° and —45° at the same time. 

we use the complex 2D wavelets developed by |Kingsbury(2002j|. The se are constructed also by direct 
product but from two simultaneous wavelet trees (see |Kingsbury(1999)| for a diagram of such a dual tree). 
The qualitative difference between these two constructions is best seen in the Fourier domain. Fig. IB2I 
shows a schematic representation of the supports of the Fourier transforms of the wavelet functions, both 
for the usual 2D wavelet construction and for the 2D complex wavelets. The two are fundamentally 
different: whereas the usual separable 2D wavelet construction gives rise to a horizontal, a vertical and 
one (!) diagonal part at each scale, the complex 2D construction has six different inherent directions per 
scale. A careful choice of the different filters also leads to an (almost) tight frame (i.e. the inverse wavelet 
transform almost coincides with the transpose). 

The price to pay for these benefits is the redundancy. In 2D the complex wavelets generate four 
times as many coefficients as there are pixels in the original image (two trees and real and imaginary 
parts of the output). E.g. the 64 x 64 spatial-domain images we use in Section 01 give rise to 16 320 = 
2 x 6 x (32 2 + 16 2 + 8 2 + 4 2 ) wavelet coefficients and 64 = 2 x 2 x 4 2 scaling coefficients (see e.g. Fig.|5J). 
Together this is 16 384 which equals 4 x 64 2 . 
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